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Abstract 

Oligomers of length k, or /c-mers, are convenient and widely used features for modeling the properties and functions of DNA 
and protein sequences. However, fc-mers suffer from the inherent limitation that if the parameter k is increased to resolve 
longer features, the probability of observing any specific fc-mer becomes very small, and /c-mer counts approach a binary 
variable, with most /c-mers absent and a few present once. Thus, any statistical learning approach using /c-mers as features 
becomes susceptible to noisy training set /c-mer frequencies once k becomes large. To address this problem, we introduce 
alternative feature sets using gapped /c-mers, a new classifier, gkm-SVIVl, and a general method for robust estimation of k- 
mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data 
structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, 
our gkm-SVIVl predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved 
accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer- 
SVM on human ENCODE ChlP-seq datasets, and further demonstrate the general utility of our method using a Naive-Bayes 
classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence 
classification problem. 
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Introduction 

Predicting the function of regulatory element.s from primary 
DNA sequence still remains a major problem in computational 
biology. These elements typically contain combinations of several 
binding sites for regulatory factors whose activity together specifies 
the developmental times, cell-types, or environmental signals in 
which the element wUl be active. Genetic variation in regulatory 
elements is increasingly thought to play a significant role in the 
etiology and heritabUity of common diseases, and surveys of 
Genome Wide Association Studies have highlighted the prepon- 
derance of significant variants in regulatory DNA [1,2]. An 
accurate computational model to predict regxilatory elements can 
1) help identify and link core sets of regulatory factors with specific 
diseases, and 2) predict the functional consequences of variation or 
mutations in specific sites within regulatory elements. 

We have recently introduced a successful method for regulatory 
DNA sequence prediction, kmer-SVM, which uses combinations 
of short (6-8 bp) k-mer frequencies to predict the activity of larger 
functional genomic sequence elements, typically ranging from 500 
to 2000 bp in length [3]. An advantage of A-mer based approaches 
relative to the alternative position weight matrix (PWM) approach 
is that PWMs can require large amounts of data to optimize and 



determine appropriate scoring thresholds [4,5], while A-mers are 
simple features which are either present or absent. However, in 
our previous implementation of the kmer-SVM [3], the choice to 
use a single k, and which k, is somewhat arbitrary and based on 
performance on a limited selection of datasets. A major 
contribution of the present work is an extension of this single k 
approach to include longer and much more general sequence 
features. The function of these DNA regulatory elements is 
generally thought to be specified at the molecular level by the 
binding of combinations of Transcription Factors (TFs) or other 
DNA binding regulatory factors, and many of these binding sites 
are short and fall within the range of A (6-8) where our kmer-SVM 
approach was successful. However, Transcription Factor Binding 
Sites (TFBS) can vary from 6-20 bp, so some are much longer 
(such as ABFl, CTCF, etc.), and thus cannot be completely 
represented by the short /i-mers. Alternatively, TFBS can be 
defined by a set of sequences with some gaps (non-informative 
positions) as each given DNA sequence has some binding affinity 
for the TF. Although the kmer-SVM method can model TFBS 
longer than k by tiling across TFBS with overlapping A-mers, this 
loses some spatial information in the binding site, and overall 
classification accuracy can be significantly impaired when long 
TFBS are important predictive features [6]. 
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Author Summary 

Genomic regulatory elements (enhancers, promoters, and 
insulators) control the expression of their target genes and 
are widely believed to play a key role in human 
development and disease by altering protein concentra- 
tions. A fundamental step in understanding enhancers is 
the development of DNA sequence-based models to 
predict the tissue specific activity of regulatory elements. 
Such models facilitate both the identification of the 
molecular pathways which impinge on enhancer activity 
through direct transcription factor binding, and the direct 
evaluation of the impact of specific common or rare 
genetic variants on enhancer function. We have previously 
developed a successful sequence-based model for en- 
hancer prediction using a /c-mer support vector machine 
(kmer-SVIVl). Here, we address a significant limitation of the 
kmer-SVM approach and present an alternative method 
using gapped k-mers (gkm-SVM) which exhibits dramati- 
cally improved accuracy in all test cases. While we focus on 
enhancers and transcription factor binding, our method 
can be applied to improve a much broader class of 
sequence analysis problems, including proteins and RNA. 
In addition, we expect that most /c-mer based methods can 
be significantly Improved by simply using the generalized 
/c-mer count method that we present In this paper. We 
believe this improved model will enable significant 
contributions to our understanding of the human regula- 
tory system. 

Naively one could addre.ss this is.sue by using longer k's or 
combinations of k-mers spanning the expected size range of TFBS, 
but a major limitation of this approach is that longer k-mers 
generate extremely sparse feature vectors (i.e. most k-mers simply 
do not appear in a training sequence and thus receive zero counts, 
or appear only once), which causes a severe overfitting problem 
even at quite moderate k. Therefore, the original kmer-SVM was 
limited in practice to k-mer lengths from 6 to 10, with performance 
already degrading at ^ = 9 or 10, depending on the dataset. Thus 
in practice, the parameter k was chosen by a tradeoff between 
resolving longer features and robust estimation of their frequen- 
cies. 

We recentiy introduced gapped k-mers as a way to resolve this 
fundamental limitation with k-mer features and showed that they 
can be used to more robusdy estimate k-mer frequencies in real 
biological sequences [7]. In this paper, we present a simple and 
efficient method for calculation of the robust k-mer count 
estimates. We also expand our kmer-SVM method [3] to use 
gapped k-mers or robust k-mer count estimates as feature sets and 
present efficient methods to compute these new kernels. We show 
that our new method, gkm-SVM, consistently and significandy 
outperforms a kmer-SVM using both CTCF and EP300 genomic 
bound regions over a wide range of varying feature lengths. 
Furthermore, we show that, while kmer-SVM suffers significandy 
from overfitting as k is increased, gkm-SVM performance is only 
very modestly affected by changes in the chosen feature length 
parameters. Next, we systematically compare the two approaches 
on the complete human ENCODE ChlP-seq data sets, and show 
that gkm-SVM either significandy outperforms or is comparable to 
kmer-SVM in all cases. Of biological interest, on the ENCODE 
ChlP-seq data sets, we also show that gkm-SVM outperforms the 
best known single PWM by detecting necessary co-factors. We also 
systematically compare gkm-SVM to similar earlier SVM 
approaches [8-11], and show that they perform comparably for 
optimal parameters in terms of accuracy, but that gkm-SVM is less 



sensitive to parameter choice and is computationally more 
efficient. To further demonstrate the more general utility of the 
k-mer count estimates, we apply them in a simple Naive-Bayes 
classifier, and show that using k-mer count estimates instead ot k- 
mer counts consistently improves classification accuracy. Since our 
proposed method is general, we anticipate that many other 
sequence classification problems will also benefit from using these 
features. For example, word based methods can also be used to 
detect functional motifs in protein sequences, where the length of 
the functional domain is unknown [12]. 

Results 

Calculation of sequence similarity score using gapped k- 
mers 

To overcome the limitations associated with using k-mers as 
features described above, we introduce a new method called gkm- 
SVM, which uses as features a fuU set of A-mers with gaps. At the 
heart of most classification methods is a distance or similarity 
score, often called a kernel function in the SVM context, which 
calculates the similarity between any two elements in the chosen 
feature space. Therefore, in this section, we first describe the 
feature set and how to efficiendy calculate the similarity score. 
This new feature set, c^Llied gapped k-mers, is characterized by two 
parameters; (1) Z, the whole word length including gaps, and (2) k, 
the number of informative, or non-gapped, positions in each word. 
The number of gaps is thus I - k. 

We first define a feature vector for a given sequence S to be 
f^= \}'],yi, ■ ■ ■ ,>'ifl^, where M is the number of all gapped k- 

mers (i.e. for DNA sequences, Af=^^^4*), and yf's are the 

counts of the corresponding gapped k-mers appeared in 
the sequence S. We then define a similarity score, or a 
kernel function, between two sequences. Si and S2, as the 
normalized inner product of the corresponding feature vectors as 
follows: 



K(SuS2)-- 



(1) 



where </^l > = , (V, ' '^^PX and \\f' \\ = ^/ifW) ■ 
Therefore, the similarity score, K(Si, S2), is always between 0 
and 1, and K{S, S) is equal to 1. We will refer to Equation (1) as the 
gkm-kernel. It is similar to the wildcard kernel introduced in Ref 
[9], but our approach differs in that we do not sum over the 
number of wild-cards, or gaps, as formulated in Ref [9] . 

Since the number of all possible gapped A-mers grows extremely 
rapidly as k increases, direct calculation of Equation (1) quickly 
becomes intractable. To implement gapped ^-mers as features, it is 
necessary to overcome this serious issue, by deriving a new 
equation for K(Si, S2) that does not involve the computation of all 
possible gapped k-mer counts. The key idea is that only the full l- 
mers present in the two sequences can contribute to the similarity 
score via aU gapped k-mers derived from them. Thus the inner 
product in Equation (1), which involves a sum over aU gapped k- 
mers, can be computed by a much more compact sum, which 
involves only a double sum over the sequential l-mers present in 
each of the two sequences: 



'=1 /=i 



(2) 
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where m,- ' is the j'th l-mer appearing in Si, u^^ is the/th l-mer 
appearing in and Mi and 71-2 are the numbers of full /-mers in Si 
and S2 respectively, i.e. «i = length(Si)— /+1 and n2 = length(S2) — 
l+l. Evaluation of Equation (2) is much more efficient than 
Equation (1) because almost always, n\n2<^M. As wiU be shown 
below, /jft(Mi, M2) only depends on the number of mismatches, m, 
between the two full /-mers, Mi and M2, i.e. hik^\, i^'ij — hikin). 
Therefore, we can rewrite Equation (2) by grouping all the /-mer 
pairs of the same number of mismatches together as follows: 

/ 

^s, j-Siy^Y^ N,„{S^ ,S2)h,kim) (3) 

m = 0 

where N^{Si, S2) is the number of pairs of /-mers with m 
mismatches, and hii,{m) is the corresponding coefficient. We refer 
to N,„(Si, S2) as the mismalch profile of Si and 82- Since each /-mer 



pair with m mismatches contributes to 




common gapped 



^-mers, the coefficient hik{m), denoted in short by h^, is given by: 

A.=(Cr) (4) 

1 0 otherwise 

Determining a mismatch profile in Equation (3) is still 
computationally challenging since the numbers of mismatches 
between all possible /-mer pairs has yet to be determined. To 
address this issue, we have developed two different algorithms. We 
first considered direct evaluation of the mismatch profiles between 
all pairs of training sequences. To minimize the cost of counting 
mismatches between two words, we develop an efficient mismatch 
counting algorithm that pra[:ti[:ally runs in constant time, 
independent of k and / parameters (see Methods). We then use 
Equation (3) to obtain the inner products for every pair of 
sequences. 

The direct and sequential evaluation of the kernel function 
between all training sequences becomes less practical as the 
number of training sequences gets larger, since it requires 0(N'^L^) 
operations of mismatch counting between /-mer pairs, where A'^ is 
the number of training sequences and L is the average sequence 
length. Because of this unfavorable scaling, we implemented an 
alternative method using a k-mer tree data structure, similar to one 
previously introduced in Ref [8] , but with some modifications (see 
Methods). This method simultaneously calculates the mismatch 
profile for all the sequence pairs, and, therefore, can significandy 
reduce the computation time especially when the number of gaps 
is relatively small, typically when I - k< — 4. We can further 
improve the efficiency if we truncate the sum in Equation (3) to 
only consider up to a maximum numb(^r of mismatches, (see 
Methods). This approximati; method is ('specially favorable when 
the number of gaps is large, but the efficiency comes at the cost of 
exact evaluation of the kernel and classification accuracy, which 
we win discuss in greater detail below. Therefore, we used one of 
the two algorithms depending on the size of data sets and the 
number of gaps we choose for analysis. 

Gapped /c-mer SVM classifier outperforms /c-mer SVM 
classifier 

Because of the difficulty of reliably estimating long /{-mer 
counts, we hypothesized that gkm-SVM would perform better 
than kmer-SVM, and that gapped k-mers would be most 
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advantageous as features, when long TFBSs are important 
sequence elements in a given data set. To direcdy test this idea, 
we compared the classification performance of gkm-SVM to kmer- 
SVM in predicting the binding sites of CTCF [13] in the human 
genome, a TF whose binding specificity has been well-character- 
ized [14]. As shown in Figure SI, CTCF recognizes very long 
DNA sequences (the full PWM is 19 bp), and the genomic CTCF 
bound regions are almost perfecdy predicted by matches to the 
CTCF PWM (Figure S2): in the PWM analysis, we used as a 
predictor the best matching log-odd score to the PWM model in 
the region, and achieved area under th(' ROC cur\'e (AUC) of 
0.983. It is very rare for a single PWM to perform this well, and in 
our experience CTCF is unique in this regard. The CTCF dataset 
therefore provides an excellent opportunity to test our gapped k- 
mer classifier. We used the top 2,500 CTCF ChlP-seq signal 
enriched regions in the GM 12878 cell line available at Gene 
Expression Omnibus (GSE19622) [13] as a positive dataset, and 
equal numbers of random genomic sequences (1 x) as a negative 
dataset. We generated these negative sequences by matching 
length, GC and repeat fraction of the positive set [6]. 

We compared the performance of gkm-SVM and kmer-SVM 
on the CTCF data set for a range of oligomer lengths by varying 
either k (for kmer-SVM) or / (for gkm-SVM) from 6 to 20. We 
fixed the parameter A = 6 for gkm-SVM. We then quantified the 
classification performance of each by calculating test-set AUC with 
standard five-fold cross validation (CV) (see Methods). Figure lA 
shows a summary of the comparisons. As anticipated, gkm-SVM 
performs consistently better than kmcr-S\'M for all lengths. More 
significandy, while kmer-SVM suffers severely from overfitting 
when k is greater than 10, gkm-SVM is virtually unaffected by /. In 
fact, gkm-SVM achieves the best result (AUC = 0.967) when / = 14 
and k = 6, which is significantly better than the kmer-SVM 
(AUC = 0.912 when A =10); the best ROC curve is shown in 
Figure IC. It should be noted, however, that the PWM 
classification result (Figure- S2) is still the best (AUC = 0.983) 
among the three methods we tested in this analysis. A complicating 
factor is that while both kmer-SVM and gkm-SVM use entire 
sequences (average length is 316 bp) to calculate the prediction 
scores, the PWM scores are from the best matching 19 bp sub- 
sequence in the region. It may be that the extra ~300 bp 
sequences contribute noise in the SVM prediction scores, which 
slightiy impairs the overall classification accuracy. In any event, 
the gkm-SVM is a significant improvement in accuracy over the 
kmer-SVM, and both gkm-SVM and the PWM are excellent 
predictors on this dataset. 

Interestingly, gkm-SVM shows consistently better performance 
than kmer-SVM even if / is relatively small (/<10) (Figure lA). 
This suggests that gkm-SVM may also be better at modeling 
diverse combinations of TFBSs than kmer-SVM. To test this 
hypothesis, we analyzed a mouse enhancer dataset of more varied 
sequence c:omposition: genomic EP30() bound regions in embry- 
onic mouse forebrain [15]. We have pre\'iously shown that our 
original kmer-SVM classifiers can accurately predict EP300 
binding when mediated by sets of active TFBSs [3] . This EP300 
data set thus provides a direct test of the effectiveness of using 
gapped k-mer features to detect more complex regulatory features. 
For this analysis, we defined a new set of the 1,693 400 bp sites 
that maximize the EP300 ChlP-seq signal within each of the peaks 
determined by MACS [16] after removing any regions which were 
more than 70% repeats. We repeated the k and I scaling with the 
EP300 data set and a 1 x negative set, and again found that gkm- 
SVM consistendy outperforms kmer-SVM for all feature lengths 
(Figure IB). Analogous to the observations modeling CTCF 
binding, gkm-SVM AUC is high and does not degrade with large 
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Figure 1. gkm-SVM outperforms kmer-SVM over a wide range of Ar-mer lengtli. Both gkm-SVM and kmer-SVM were trained on (A) CTCF 
bound and (B) EP300 bound genomic regions using different word lengths {k for kmer-SVM and / for gkm-SVM). The parameter k for gkm-SVM was 
fixed at 6. While AUCs of the kmer-SVMs show significant overfitting in both cases as k gets larger (dotted), gkm-SVMs accuracy is higher for a broad 
range of larger / (solid). Results using the truncated Gkm-SVM with m^^^ = 3 are shown as dashed lines and AUCs of these faster approximations are 
comparable when the difference between m^ax and / - are relatively small. ROC for the optimal k or I for each case are shown in (C) and (D). Gkm- 
SVMs (solid) consistently outperform kmer-SVMs (dashed) on both data sets. Error bars here and below represent 5-fold CV standard deviation. 
doi:l 0.1 371 /journal.pcbr.l 00371 1 .gOOl 



I. In contrast, the kmer-SVM accuracy drops rapidly as k 
increases. Moreover, although the difiFerence in performance is 
smaller than found for the CTCF data set, the gkm-SVM achieves 
the best AUC (0.947) with Z = 9 and A = 6, while the kmer-SVM 
achieves 0.932 with k = l , suggesting that loiig(;r /i-m(;rs with some 
flexibility do contain more complete information about TF 
binding (Figure ID). At the same time, the gapped ^-mer features 
are more robusdy estimated (having more counts) and for this 
reason make more reliable predictors. The consequences of these 
improvements in AUC are significant when considering the 



genome-scale precision of the improved gkm-SVM classifiers. The 
rate of false positive predictions is dominated by the large neutral 
fraction of the genome, so the precision of a genome-scale classifier 
is best assessed by a Precision-Recall curve in combination with a 
much larger negative set, as discussed in Ref [3]. The Precision- 
Recall curves for the gkm-SVM and kmer-SVM classifiers on a 
lOOx negative set are shown in Figure S3. For CTCF, at a recall 
of 50%, the precision increases from 36% to 59%. These ranges of 
precision and recall are in the relevant range of experiments 
aiming to discover and test novel enhancers, and we therefore 
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expect that predictions based on gkm-SVM will have up to a two- 
fold higher successful validation rate. 

One further modification can substantially reduce the compu- 
tational cost of using gapped ^i-mers with little degradation in 
performance. The algorithm using the k-mer tree data structure 
produces identical results to the direct evaluation of Equation (3), 
but typically is much faster when the number of mismatches, I - k, 
is smaller than four, and the number of training sequences is large. 
The k-me.r tree algorithm can be made even more computationally 
efficient, if we prune the traversal of tlu; trc'(% by ignoring any k- 
mcr pairs that have more mismatciies than a predetermined 
parameter, nimax- This provides an approximation to the exact 
kernel calculation, but the approximation error is usually 
negligible given that the coefficient h„ for large numbers of 
mismatches are generally much smaller compared to those with 
small m. This approximation significantly reduces the total 
number of calculations and allows the user to control the running 
time of the algorithm by setting the parameter nimaxt and makes 
the use of longer word lengths / feasible for any given k. To 
systematically investigate the classification performance of this 
approximation, we applied the same analysis above using both 
CTCF and EP3()0 data sets (Figure lA, B), and found that AUCs 
from the approximate method are virtually identical to the exact 
method when the difference between m„^ and I - k are small. 
Interestingly, the approximation method achieved even higher 
AUG with CTCF data set in some cases. 

Encouraged by the analyses of CTCF and EP300 data sets 
above, we systematically compared gkm-SVM to kmer-SVM using 
a very broad range of human data sets generated by the ENCODE 
project [17,18]. We used 467 sets of ChlP-seq peaks produced by 
the ENCODE uniform processing pipeline containing at least 500 
regions (see Methods). We truncated any data set with greater than 
5,000 regions by random sampling. We then trained both kmer- 
SVM and gkm-SVM on each set against an equal size (1 x) 
negative set of random genomic regions and calculated AUCs with 
five-fold cross validation. We used k — 6ioT kmer-SVM, and Z = 10 
and k = 6 for gkm-SVM, but as shown in Figure 1 the 
improvements are generally insensitive to these parameter choices. 
Strikingly, we find that gkm-SVM almost always outperforms 
kmer-SVM (Figure 2A). We also find that variances of AUCs from 
test CV sets are generally reduced, suggesting that gkm-SVM is 
more robust than kmer-SVM (Figure S4). More significantiy, gkm- 
SVM performs much better especially for TFs with long binding 
sites. In this dataset, most of these long binding sites arise in ChlP- 
seq data sets for CTCF and members of the cohesin complex 
(RAD21, SMC3) known to be physically associated with CTCF 
[19]. On these CTCF associated factors gkm-SVM exhibits much 
higher AUC than kmer-SVM, as highlighted by the cluster of 
purple circles in Figure 2A. We have also compared gkm-SVM to 
the best single PWM AUC as shown in Re£ [6] (Figure 2B). As 
expected, gkm-SVM outperforms all datasets except CTCF, for 
which gkm-SVM performance is only marginally reduced. For a 
consistent analysis of this dataset, we used / = 1 0 and k — 6, 
although for CTCF the gkm-SVM performance is optimal at 
larger I, as seen in Figure lA. 

Motif analysis of the ENCODE ChlP-seq data sets 

The predictive sequence features that allow gkm-SVM to 
outperform the single best PWM imply that cooperative binding is 
the underlying molecular mechanisms that targets TFs to these 
regulatory regions. Previously we have typically focused on a 
handful of the highest SVM weight ^-mers (say top ten positive 
and top ten negative weight k-mevs) to interpret the classification 
results [3,6,20]. This simple method becomes unwieldy when 



applied to the gkm-SVM results because of the large number of 
very similar significant features (when I and/or k are large). 
Although the ^-mers at the extreme top and bottom tails of the k- 
mer weight distribution are still important and biologically 
meaningful, those k-mers usually cover only a fraction of the 
significant feature set, and many more important features are 
included in the larger tails of the A-mer weight distribution. 
Therefore, more sophisticated algorithms are needed to extract the 
biologically relevant features from the classification results. 

To directly address this issue, we developed a new method to 
combine multiple similar k-nicis into more compact and 
interpretable PWMs and analyzed the 467 ENCODE data sets 
[18]. In this approach, we used a larger number of predictive k- 
mers to build de novo PWMs (see Methods). We used the top 1% 
of 10-mers from each of the gkm-SVMs trained on the ENCODE 
data sets and identified up to three distinct PWMs (Figure S5) from 
k-mers in this set. We then compared our results with the previous 
PWMs found in the same data sets using a conventional tool 
(MEME-ChIP) [18,21]. Similar to our approach, Wang et al. 
analyzed 457 ENCODE ChlP-seq data sets (440 sets are in 
common with those we analyzed above) and identified five PWMs 
from each data set. Collectively, Wang el al. found 79 distinct 
PWMs enriched, of which our method recovered 74. Comparing 
each ChlP-seq data set individually, we recovered most of the 
PWMs reported by Wang et al. using our method (Figure 2C). 
Interestingly, while Wang et al. largely failed to identify 
biologically meaningful PWMs from most of the POL2 ChlP- 
seq data sets (47 out of 58 sets returned no meaningful PWMs), our 
methods frequently identified cell-specific TFs as well as promoter 
specific TFs (Figure S5). For example, the GATAl TF identified 
from POL2 ChlP-seq in the erythroleukemic cell line K562 is 
known to play central roles in erythroid differentiation [22]. The 
ETS 1 TF from HUVEC is another extensively studied TF, known 
to be important for angiogenesis [23]. A major difference between 
the two methods is the number of training sequences. While the 
previous study was limited to the top 500 of ChlP-serj peaks 
(ranked by ChlP-seq signal), we were able to use 1 0 x larger 
numbers of ChlP-seq peaks (5,000 regions), and the large training 
sizes enabled us to robustly identify diverse combinatorial 
sequence features. 

Comparison to previous kernels 

Since the early development of A-mer based supervised machine 
learning techniques [24], there have been a number of improve- 
ments. Some of these extend the feature set to include imperfect 
matches, similar in spirit to our gkm-SVM. The mismatch string 
kernel [8] is one such method, originally motivated by the fact that 
homologous protein sequences are not usually identical and have 
many frequentiy mutated positions. The mismatch kernel also uses 
A-mers as features, but allows some mismatches when counting k- 
mers and building feature vectors. The wildcard kernel [9] is 
another variant of the original string kernel, which introduces a 
wildcard character that matches any single letter in the given 
alphabet. More recentiy, an alternative di-mismatch kernel [10] 
has been proposed to direcdy model TFBSs, and has been 
successfully applied to protein binding microarray (PBM) data sets 
[25] and several other ChlP-seq data sets [10,11]. The di- 
mismatch method tries to overcome the limitation of the mismatch 
kernel by favoring A-mers with consecutive mismatches. However, 
in a recent comparison of methods for modeling transcription 
factor sequence specificity, full k-mer methods outperformed the 
di-nucleotide approaches when applied to PBM data [26]. 

To further evaluate our proposed method, we direcdy compared 
the gkm-kemel with the aforementioned three alternative methods. 
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Figure 2. gkm-SVM consistently outperforms kmer-SVM and the best known PWiVI on human ENCODE ChlP-seq data sets. (A) We 

trained gkm-SVM and kmer-SVM on the complete set of 467 ENCODE ChlP-seq data sets (with /( = 6 for kmer-SVM, and / = 10 and /c = 6 for gkm-SVM). 
gkm-SVM AUC is consistently higher than kmer-SVM with only a few very minor exceptions. The gkm-SVM method specially outperforms the kmer- 
SVM for the data sets bound by members of the CTCF complex, highlighted as purple circles. (B) We also compared gkm-SVM and the best known 
PWM on the same data sets, and gkm-SVM AUCs are significantly higher than the PWM AUC in almost all cases. (C) The ENCODE data sets were 
divided into four groups: (1) no PWM, (2) only one PWM, (3) two PWMs, and (4) three or more PWMs identified by Wang ef al. Then, for each group 
except the first one, we calculated the number of PWMs recovered by our method. At least one PWM was recovered for more than —90% of the data 
sets. 

doi:1 0.1 371 /journal.pcbi.1 00371 1.g002 



Mismatch kernel [8], Wildcard kernel [9], and Di-mismatch 
kernel [10,11], using the mouse forebrain EP300 data set. As 
shown in Figure 3, gkm-kernel outperforms the other three 
existing methods both in terms of the classification accuracy and 
running time. The best AUC we achieved for gkm-kernel is 
0.947 as compared to 0.937, 0.935, and 0.944 for the wildcard 
kernel, mismatch kernel, and di-mismatch kernel, respectively 
(Figure 3A). Although the wildcard kernel and gkm-kernel are 
quite similar, the systematic improvement in gkm-kernel AUCs 
is primarily due to the incorporation of reverse complement 
sequences. We direcdy tested this by adding reverse complement 
sequences to the feature set for the previously published 
methods, and indeed found that with this modification, these 
methods were also able to achieve comparable AUCs (Figure 
S6). 

More significandy, when we compare running times at 
parameters which maximize AUC for each method, our 



gkm-SVM implementation (/ = 9,l-k = 3) is roughly two orders of 
magnitude faster than di-mismatch (10,3), and slightly more than 
one order of magnitude faster than mismatch (/= 10,m = 2) and 
wildcard (/ = 8,m = 3) on the EP300 data set (Figure 3B and 
Figure S7). Also, by fixing k = 6 and the parameter m,,^,^ in our 
algorithm, the AUC becomes less sensitive to the feature length I, 
compared to a scan at fixed m, varying k (Figure 3A). Direct 
running time comparisons using our tree structure in the 
mismatch and wildcard kernels (described below) are shown in 
Figure S7B and S7C. We should note that we were only able to 
test the di-mismatch kernel up to Z = 1 0, because it required more 
than 1 28 GB of memory and did not finish within 2000 minutes 
when using / = 1 1 . 

Interestingly, we also note that both Mismatch kernel and 
Wildcard kernel are special cases of the more general class of 
kernels, defined by Equation (3). This unification allows direct 
apphcation of the methods we developed for mismatch profile 
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Figure 3. Comparison of gItm-SViVI and existing methods on the mouse forebrain EP300 data set. (A) For each method, averages of 5-CV 
AUCs are shown as a function of the word length with the optimal number of mismatches, m, held fixed. Also shown are gkm-SVM results using fixed 
k = 6 and varying mmax- (B) Running time for each of the kernel computations shown in (A). Gkm-kernels show better classification performance and 
significantly more efficient computation at peak AUG. 
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computation and therefore gives more efficient methods for 
computation of these existing methods (see Methods). 

Calculation and performance of estimated /-mer 
frequencies for gkm-SVM 

As an alternative to the gapped A-mer feature set, we also 
developed an alternative kernel by replacing the k-mer counts with 
robust Z-mer count estimates [7] in our original kmer-SVM 
framework. We have developed efficient methods to compute this 
new kernel (see Methods). In Ref [7], we considered the mapping 
from /-mers to gapped k-mas. Among all possible sets of Z-mer 
frequencies that could produce the same gapped A-mer frequency 
distribution, we developed a metiiod to estimate the "most likely" 
Z-mer frequency set. FuU details of this method are described in the 
Ref [7]. In brief, we first define a gapped k-mer count vector 
flk = bl'J'Zj ■ ■ • ^Jm]^ similar to the definition of the gapped k-mer 
feature vector for gkm-SVM as shown above. Then, the count 
estimate, Xu, for Z-mer u is given by 

M 

Xu=^Wiyi (5) 

! = 1 

The weight vv, in Equation (5) was shown to only depend on the 
number of mismatches, m, between the gapped k-mer correspond- 
ing to and M, and takes the following form: 



where h is the alphabet size and is equal to four in case of DNA 
sequences (A,C,G and T). Since the above equation is applied to 
every Z-mer, it would provide a non-zero frequency even for an l- 
mer that does not have any exact match appearing in any training 
set sequence. 

Direct calculation of Equation (5), however, requires actual 
counting of aU of the M gapped A-mers, which becomes 
computationally intractable for large Z and k in a way similar to 
Equation (1). Besides, summing up a large set of floating point 
numbers may result in poor numerical precision. To overcome 
these issues, we developed a simple method, referred to as the gkm- 
filter, to more efficientiy calculate the robust Z-mer count 
estimates, Xu, without calculating the intermediate gapped A-mer 
counts (see Methods). In summary, in the calculation of the robust 
Z-mer count estimates, we give a non-zero weight to Z-mers with 
few numbers of mismatches. The k-mer frequency estimation 
method is not constrained to produce non-negative frequencies 
and may occasionally generate negative count estimates. To obtain 
strictly positive frequencies, we used a revised version of the gkm- 
filter method, which we call the truncated gkm-filter. Finally, we 
developed a method to directly calculate the kernels using these 
feature sets (see Methods). An important result here is that the 
evaluation of the gkm-kernel (the inner product of the Z-mer count 
estimates vectors) is stiU given by Equation (3), but with a new set 
of weights Cikiin) given by Equation (14), below, replacing hi^im). 
Therefore, efficient algorithms for pairwise mismatch profiles that 
we developed for the gkm-kernel can be directly used for this new 
feature set without any modification. Because of this symmetry, we 
also refer to this method as gkm-kernel with (full or truncated) 
filter. A numerical example using count estimates on two short 
sequences is provided in Text SI. 

To systematically compare the classification performance of 
these new methods with the original gapped k-mers, we repeated 
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the previous analysis with the ENCODE ChlP-seq data sets. Using 
the truncated gkm-filter yields results highly comparable to the 
original gkm-SVM for most datasets with modestly but consis- 
tently better relative performance when AUG is greater than 0.9 
(shown as purple circles in Figure S8A). Any improvement in the 
range of high AUG (>0.9) typically strongly reduces the classifier's 
False Prediction Rate [27], therefore, we generally recommend the 
truncated filter method as the method of choice for most analyses. 
Gompared to the original gkm-SVM, using the gkm-SVM with 
fuU filter yields lower AUCs (Figure S8B) although it is still 
significandy higher compared to the kmer-SVM method. 

Application of the robust /-mer count estimates for 

Naive-Bayes classifier 

So far, we have focused on using gapped A-mer based methods 
for improving sequence kernel methods. We have shown that, by 
direct use of gapped A-mers as features or by using the robust Z-mer 
count estimates, we can significandy overcome the long A-mers' 
sparse count problem for these methods. We further demonstrate 
the general utility of the robust /-mer count estimates in sequence 
classification problems by applying it to a simple Naive-Bayes (NB) 
classifier similar to the one previously introduced in Ref [28] and 
show that by using robust count estimates instead of conventional 
k-mer counts we can significantly boost the performance of the 
Naive-Bayes classifier for long A-mers. 

Ih-rc-, wc used the log-likcliliood ratio of the estimated Z-mer 
frequencies in the positive and negative sets as a predictor, using 
the NB assumption of feature independence. The prediction score 
of any given sequence of length n, denoted by S = S(^i...s„-i, is 
then given by: 

fNsiS) = 2_, log -TT— 7 (7) 

where Np and Nj^ axe the robust count estimates of the 
corresponding Z-mers, 5fSj+i...5,+;_i, in the positive and negative 
training set, and are given by Equation (11) below. We used the 
truncated gkm-filter method adding pseudo-count (half of the 
smallest positive coefficient of the truncated gkm-filter) to each of 
the estimated frequencies to obtain stricdy positive frequencies for 
log-likelihood ratio. As a comparison, we also implemented the NB 
classifier without the gkm-filter, using actual Z-mer counts with a 
pseudo-count (0.5) for Np and N^. We predicted the CTCF and 
EP300 genomic bound regions with both NB classifiers (i.e. with 
and without using robust count estimates). As shown earlier, 
genomic GTGF bound regions are almost perfectly predicted by 
the single GTGF PWM (Figure S2), and the local sequence 
features around the GTGF binding motif do not seem to 
significandy contribute to the prediction. Thus, to precisely detect 
the GTGF binding motif and achieve the best classification 
performance, we scored every substring of length n — 15+Z— 1 for 
each sequence and assigned the maximum as the final score for the 
sequence. The window size of 15 was chosen to optimize the 
detection of the GTGF site within a small window of flanking 
sequence, which maximizes the performance of the NB classifier 
without the gkm-filter. For the EP300 genomic bound regions, in 
contrast, we used the fuU sequence in both classifiers. We compare 
the performance of these NB classifiers on both data sets in 
Figure 4 for a range of feature length (6-20 bp). Similar to the 
previous analysis using gkm-SVM and kmer-SVM (Figure 1), 
using robust count estimates (gkm-filter) significandy improves the 
classification accuracy especially for longer k-mers (Figure 4). On 
the GTGF data set, the NB classifier using the gkm-filter achieves 



best performance with Z = 20 (AUG = 0.99), which is even better 
dian that of the GTGF PWM (red dotted line, AUG = 0.983) 
(Figure 4A). Also on the EP300 dataset, the gkm-filter significandy 
improves the overall performance of NB classifier (Figure 4B). The 
superior classification performance using gapped k-mer based 
features is thus consistent for both SVM and NB classifiers, and 
strongly suggests that the robust Z-mer count estimates provide a 
more complete and robust set of sequence features than simple k- 
mers in most sequence classification problems, as hinted at in our 
preliminary analysis of A-mer frequency spectra in Ref [7]. 

Discussion 

In this paper, we presented a significandy improved method for 
sequence prediction using gapped k-mers as features, gkm-SVM. 
We introduced a new set of algorithms to efficientiy calculate the 
kernel matrix, and demonstrated that by using these new methods 
we can significantly ()\crc()m(; the sparse k-mer count problem for 
long k-mers and hence significandy improve the classification 
accuracy especially for long TFBSs. Detailed comparisons of our 
proposed method with some existing methods show that our gkm- 
SVM outperforms existing methods in terms of classification 
accuracy on benchmark data and is also tyjoically orders of 
magnitude faster. We also introduced the concept of gkm-filters for 
efficient calculation of the robust A-mer count estimates and 
derived optimal weights for penahzing different number of 
mismatches. We showed that one could successfully replace k- 
mers with robust A-mer count estimates to avoid long A-mer sparse 
count problem, and demonstrated the effectiveness of this method 
by showing examples in SVM and Nai\'(;-Bay(-s classiiicrs. We thus 
expect that most A-mer based methods can be significandy 
improved by simply using this generalized A-mer count. 

The main biological relevance of the computational method we 
present in this paper is that gkm-SVM is capable of accurately 
predicting a wide range of specific classes of functional regulatory 
elements based on DNA sequence features in those elements 
alone. This in itself is interesting and implies that the epigenomic 
state of a DNA regulatory element primarily is specified by its 
sequence. In addition, our predictions facilitate direct investigation 
of how these elements function, either by targeted mutation of the 
predictive elements within the larger regulatory region, or by 
modulating the activity of the TFs which bind the predictive 
sequence elements. We are currendy using changes in the gkm- 
SVM score to systematically evaluate the predicted impact of 
human regulator)- variation (single nucleotide polymorphisms 
(SNPs) or indels) to interpret significant SNPs identified in genome 
wide association studies. We demonstrated that gkm-SVM is 
better at predicting aU ENGODE GhlP-seq data than the best 
single PWM found from the GhlP-seq regions, or previously 
known PWMs. The gkm-SVM is able to do so by integrating 
cofactor sequences which may not be directiy bound by the GhlP- 
ed TF but facilitate its occupancy. To predict this ChlP-seq set 
accurately required the improv(-d accuracy of the gkm-SVM and 
its ability to describe longer binding sites such as CTCF, which 
were very difficult for our earlier kmer-SVM approach. We 
recovered most of the cofactors found by traditional PWM 
discovery methods, but we further show that these combinations of 
cofactors are predictive in the sense that they are sufficient to 
define the experimentally bound regions. 

There are some further issues that need to be considered in the 
application of these methods. First, one wiU typically be interested 
in finding an optimal set of the parameters (/ and A) to achieve the 
best classification performance. A significant advantage of gapped 
A-mer methods over A-mer methods is that they are more robust 



PLOS Computational Biology | www.ploscompbiol.org 



8 



July 2014 I Volume 10 | Issue 7 | e1003711 



Gapped /t-mer SVM for Regulatory Elements 



B 



5 



d 



00 

d 



1^ 
d 



d 



ID 

d 



d 




CTCF 



I 

\ 

\ 

Naive Bayes with gkm-filter 
Naive Bayes with k-mers 
best PWM AUC (.983) 



"1 — \ — \ — \ — r 

10 12 14 16 18 
Word length 



d 



00 

d 



d 



d 



in 
d 



d 




EP300 



Naive Bayes with gkm-filter 
Naive Bayes with k-mers 
best gkm-SVM AUC (.947) 



20 



"1 — \ — \ — r 

10 12 14 16 
Word length 



18 20 



Figure 4. Gapped Ar-mer features also improve performance of Naive-Bayes classifiers. Naive-Bayes classifiers were trained on (A) CTCF 
bound and (B) EP300 bound genomic regions using different word lengths, k, using both actual k-mer counts (dashed), and estimated /c-mer counts 
from the gkm-filter (solid). As shown above for SVM, the Naive-Bayes accuracy as measured by AUC is systematically higher using gapped /c-mer 
estimated frequencies instead of actual /c-mer counts, further supporting the utility of gapped /c-mer based features. For CTCF the Naive-Bayes AUC is 
comparable to the best SVM (dotted red lines), but for EP300 the SVM outperforms the Naive-Bayes classifier. 
doi:1 0.1 371 /journal.pcbi.1 00371 1.g004 



and are less sensitive to the particular choices of / or A compared to 
kmer-SVM or NB classifiers, as shown in Figure 1 and Figure 4. 
Nevertheless, these parameters can stiU be optimized to maximize 
cross validation AUC. As a general rule, we have found that when 
choosing the parameter k, which determines how difiFerent 
numbers of mismatches are weighted, given a whole word length 
I, smaller values of k (typically less than 8) are usually better when 
important sequence elements are believed to be more degenerate 
or when only small amount of training data is available. Although 
the choice of k directly affects the feature set, our analysis of 
several datasets shows that the overall performance of the classifier 
is not very sensitive to changes in k. The parameter / is directly 
related to TFBS lengths and should be comparable to or slightly 
larger than the longest important feature, as demonstrated by our 
analysis of the CTCF and EP300 data sets in Figure 1 and 
Figure 4. 

Our approach also avoids an issue that would arise if one chose 
instead to directly use Equation (5) for computing count estimates. 
This would involve a large number of floating point operations, 
and accumulated round-off error could become significant in the 
large summations. There are some algorithms, such as Kalian 
compensated summation [29], which can significantly reduce this 
error, however, we explicidy avoided evaluating this sum by first 
computing the mismatch profiles between sequences, which 
involves only integer calculations. Then, we calculate the weighted 
sum of the number of mismatches using Equation (11), which 
involves a much smaller number of floating point operations. 

Two issues which are left for future investigation are diflerent 
treatment of end vs. internal gaps, and allowing imperfect 
mismatches. We currently do not make special consideration for 
gaps which occur at the end of a k-mer instead of internal gaps. 
Also, our implementation of a mismatch treats all nucleotides 
equally, but often TF binding sites can prefer an A or T in a given 
position, or a purine vs. pyrimidine pair. Our approach recovers 



these preferences by assigning different weights to A-mers which do 
not have gaps at these positions, but including a wider alphabet 
including (W,S,Y,R) for (AT,GC,AG,CT) may have some 
advantages. 

Throughout this paper, we have focused on using DNA 
sequences as features for classifying the molecular or biological 
function of a genomic region. However, in principle, our method 
can be applied to any classification or prediction problem 
involving a large feature set. In general, when the number of 
features used by a classifier increases, the number of samples in the 
training set for each point in the feature space becomes smaller, 
and small sample count issues occur (which we have resolved using 
gapped A-mers). One approach to the large feature space is feature 
selection, which selects a subset of features and builds a classifier 
only using those features, ignoring all the other features. However, 
usually a limited subset of features cannot explain all the variation 
in the predicted quantity. While hypothetical at this point, our 
analysis suggests that an alternative approach might be of general 
value. Analogous to the way we have used gapped A-mers to more 
robustiy estimate A-mer feature frequencies, we speculate that 
there may be a general approach which uses subsets of a larger 
feature set to combine observed feature counts with weights 
reflecting the similarity to some generalized feature. These 
estimated feature frequencies will be less susceptible to statistical 
noise by construction, and thus may provide consistently better 
classification performance, as we have shown for gapped A-mers. 

Methods 

Support Vector Machine 

The Support Vector Machine (SVM) [30,31] is one of the most 
successful binary classifiers and has been widely used in many 
classification problems. We have previously developed an SVM 
based framework, or "kmer-SVM", for enhancer prediction and 
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have successfully applied to embryonic mouse enhancers [3] and 
many other regulatory datasets [6,20]. Briefly, our kmer-SVM 
method finds a decision boundary that maximally discriminates a 

set of regulatory sequences from random genomic non-regulatory 
sequences in the k-mev frequency feature vector space. Here, we 
developed new kernel functions using gapped k-mers and /-mer 
count estimates as features, and software that calculates the kernel 
matrix. For SVM training, we developed a custom Python script 
that takes the kernel matrix as input and learns support vectors. 
We used Shogun Machine Learning Toolbox [32] and SVM-light 
[33] for the SVM training script. As an alternative method, we 
also implemented an SVM classifier based on the iterative 
algorithm described in Ref [34]. 

Direct computation of Gkm-I<ernel 

For direct computation of the gkm-SVM kernel matrix, we 
represent each training sequence with a list of Z-mers and 
corresponding count for each Z-mer. Then for each pair of 
sequences, we compute the number of mismatches for all pairs of l- 
mers and use the corresponding coefficient A„ to obtain the inner 
product of Equation (3). As the number of unique /-mcrs in each 
sequence is L and the number of sequences is A'^, this algorithm 
would require comparisons. In addition, a naive 

algorithm for counting the number of mismatches between two 
Z-mers (i.e. the hamming distance) would be 0(1). Our implemen- 
tation employs bitwise operators, providing a constant-factor 
speedup. Briefly, using two bits to represent each base (A,C,G and 
T), we used an integer variable to represent non-overlapping 

substrings of t base pairs of the Z-mer, therefore using total 

integers to represent each /-mer, where f •] is the ceiling function. 
For counting the number of mismatches, we take the bitwise XOR 
(exclusive OR) of the integer representations of the two Z-mers and 
use a precomputed look-up table to obtain the total number of 
mismatches using the XOR result. This method requires a look-up 
table of size 2^'. The optimal value of t depends on the processor 
architecture and amount of cache memory. We used i = 6 for our 
analysis. 

Gkm-kernel with k-mer tree data structure 

As depicted in Figure 5, we use a k-vav.r tree to hold all the l- 
mers in the collection of all of the sequences. We construct the tree 
by adding a path for every Z-mer observed in a training sequence. 
Each node ti at depth d represents a sub-sequence of length d, 
denoted by s(f,i), which is determined by the path from the root of 
the tree to the node Each terminal leaf node of the tree 
represents an Z-mer, and holds the list of training sequence labels in 
which that Z-mer appeared and the number of times that Z-mer 
appeared in each sequence. As an example, Figure 5 shows the 
tree that stores aU the substrings of length Z = 3 in three sequences 
Si=AAACCC, S2 = ACC, and S3 = AAAAA. Then, to evaluate 
the mismatch profile we traverse the tree in a depth-first search 
(DFS) [35] order. In contrast to the mismat[:h tree used in Ref. [8] , 
here for each node ti, at depth d, we store the list of pointers to all 
the nodes tj at depth d for which sffij and s(f,j) have at most I - k 
number of mismatches. We also store the number of mismatches 
between ,?(/,) and s{lj). Similar to the mismatch tree [8], we do not 
need to store these values for all the nodes in the tree, but we 
compute them recursively as we traverse the tree. When reaching 
a leaf node, we increment the corresponding mismatch profile 
N„(Si, Sjj for each pair of sequences Si in that leaf node's sequence 
list, and all the Sfs in the list of sequences in the pointer list for that 
leaf node. At the end of one DFS traversal of the tree, the 



mismatch profiles for all pairs of sequences are completely 
determined. 

To increase the speed further, we also introduce an optional 
parameter m^^x, which limits the maximum number of 
mismatches. By setting smaller than I - k, we only consider 
Z-mer pairs that have at most m,,^^. number of mismatches. This 
can reduce calculation significantly by ignoring Z-mer pairs which 
potentially contribute less to the overall similarity scores. This 
method provides fast and efficient approximations of the exact 
solution. In addition, we only compute the lower triangle of the 
matrix because of the symmetry in tlu; kernc-l matrix. Hence, at 
each node ti, we exclude the nodes tj in the list that have 
maxID(^j)<minID(<^), where minID(y and maxID(^y) are the 
maximum and minimum sequence ID in the subtrees of /, and tj 
respectively and are computed and stored for each node at the 
time we buUd the tree. 

Analysis of de novo PWMs from gkm-SVM 

We developed a new method for building de novo PWMs by 
systematically merging the most predictive k-mers from a trained 
gkm-SVM. We first determined a set of predictive k-mers by 
scoring aU possible 10-mers and selecting the top 1% of the high- 
scoring 10-mers. We then found a set of distinct PWM models 
from these predictive 10-mers using a heuristic iterated greedy 
algorithm. Specifically, we first built an initial PWM model from 
the highest scoring 10-mer. Then, for each of the remaining 
predictive 10-mers, we calculated the log-odd ratios of aU possible 
alignments of the 10-mer to the PWM model, and identified the 
best alignment (i.e. the position and the orientation that give rise to 
the highest log-odd ratio value). Since multiple distinct classes of 
TFBSs are expected to be identified in most cases, we only 
considered 10-mers with good alignments (i.e. we used threshold of 
5.0 for log-odd ratio scores relative to a genomic GG = 0.42 
background). After each of the 10-mers was aligned, we updated 
the PWM model only with successfully aligned 10-mers. To 
further refine the PWM, we repeated this by iterating through aU 
of the top 1% 10-mers until no changes were made. When 
updating the PWM model, we assumed that the contribution of 
each ^-mer is exponentially weighted proportional to its SVM 
score, using exp(Q( Wj), with a = 3.0. The 10-mers used for creating 
the 1 PWM were then removed from the list, and the process was 
repeated on the remaining predictive k-mers, to find up to three 
PWMs. Lastly, we matched our PWMs to the previously identified 
PWMs [18] using TOMTOM [36] software. Each of die PWMs 
identified by our method were associated with Ref [18] PWMs if 
the ^'-value (false discovery rate) <0.05. 

Implementation of mismatch and wildcard kernels using 
gkm-kernel framework 

In the gkm-kernel, we define the feature vector to consist of the 

frequency of all the Z-mers with exactly k known bases and I - k 
gaps. In contrast, the wildcard kernel [9] also includes aU the Z- 
mers with I — k wildcards, where I - k ranges from 0 to the 
maximum number of wildcards aUowed, M. Thus in the wUdcard 
kernel, the parameter M replaces k in our gkm-kernel. In the sum, 
these are weighted by X' " * to penalize sequences with more 
wildcards [9]. We derived an equation to directly compute the 
inner products from the mismatch profiles without the need to 
calculate the actual gapped A-mer counts. Here we show that a 
simUar approach can be used to calculate the wUdcard kernel. We 
derive a new set of coefficients hfj^(m) that can substitute h„„ in 
Equation (3). To evaluate h'l'^{m) we only need to consider the 
contribution of each pair of Z-mers with m mismatches in the inner 
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DFS(node, depth, {(node, mismatches)}) 
DFS(to , 0, {(to, 0)}) 

DFS(ti , 1, {(ti, 0), (t2, 1)}) 

DFS(t3 , 2, {(t3, 0), (t4, 1), (ts, 2)}) 

DFS(t6 , 3, {(te, 0), (t„ 1), (ts, 2), (tg, 3)}) 
DFS(t7,3, {(te, l),(t7, 0),(t8, l),(t9, 2)}) 

DFS(t2 , 1, {(ti, 1), (t2, 0)}) 



Figure 5. Fast computation of mismatch profiles using Ar-mer tree structure. As an example, we use 1=3 and three sequences S, = AAACCC, 
5^ = AAAAA, and 53 = ACC to build the k-mer tree. The leaves (nodes at depth d = / = 3) correspond to 3-mers AAA, AAC, ACC, and CCC. The sequence 
ID and the number of times each 3-mer appeared in each sequence are stored for each leaf. Each node f, at depth d represents a sequence of length 
d, denoted by s(f,), which is determined by the path from the root of the tree to f,-. For example, 5(12) = C and sCf^) = AC. DPS is started at the root node, 
to- When visiting each node f,-, at depth d, we compute the list of all the nodes tj at depth d for which s(f,) and s{tj) have at most m^ax mismatches. We 
also compute the number of mismatches between s(f,) and s(tj). When reaching a leaf, we increment the corresponding mismatch profile Wm(S„ Sy) for 
each pair of sequences S, in that leaf and Sj in the list. 
doi:1 0.1 371/journal.pcbi.1 00371 1 .g005 



product of the corresponding feature vectors of the two sequences. 
Equation (8) gives those weights: 

k' = l-M k' = l-M \ «^ / 

Using the above form allows us to directly use the fast 
algorithms we have developed for calculation of the mismatch 
profiles to calculate the wildcard kernels. Although there are 
similarities between our tree algorithm and the tree algorithm 
described in Re£ [9], there are some key differences. In the Ref. 
[9], the algorithm literally transverses all the possible gapped l- 
mers (with maximum M number of gaps) while our algorithm 
takes advantage of the fact that the final inner product will only 
depend on the number of pairwise mismatches and hence only 
traverses all the Z-mers that are present in the input data. Another 
difference is that Ref. [9] uses a list of all partially matching Z-mers 
at each node of the tree, while we use a list of pointers to tree 
nodes instead. So, for example, at the beginning of the algorithm 
(at depth d = Q) they start with a large list consisting of all the 



possible Z-mers in the input data, while in our algorithm the list at 
depth d = Q consists of only one node (the root of the tree). Using 
this repres(;ntation of all the partially matching Z-mers, we can 
more efficiendy perform the comparisons at each step of the 
algorithm when the tree is dense. 

In the mismatch string kernel described in Ref [8] and [9], the 
feature vectors consist of the counts for all the Z-mers with 
maximum distance M from the Z-mers in the sequence. Here we 
show that the approach above can be used to implement the 
mismatch kernel. Again, the only difference is in the set of weights 
used in Equation (3). To calculate the mismatch string kernel value 
for two sequences we replace /»«(?») in Equation (3) by h'J"^'"'^''{m): 

M M M /l_m\ /^\ / \ (9) 

mi=Om2=or=oV W \ / \mi-t-rj 

where h is the alphabet size (ft = 4 for DNA sequences) and r = mi+ 
m2—m — 2t. Given two Z-mers Xi and X2 where Xi and X2 differ in 
exactly m places, the term inside the summations counts the 
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number of all possible i-mers that exactly differ Xi in Wi places and 
X2 in OT2 places t of which fall in the common bases of Xi and 
X2- (See Figure S9). So the result of the summation is the number 
of all /-mers that differ Xi and in at most M places. This form for 
the mismatch string kernel has the advantage that we can direcdy 
use equation (3) to compute the kernels by only having the 
mismatch profiles that can be computed more efficiendy. 

Gkm-filters for computation of the robust /-mer count 
estimates 

To compute the /-mer count estimates by using Equation (5), 
one should first calculate the gapped k-mev counts, 3),, and then use 
Equation (5) to combine the Ji with a weight corresponding to the 
number of mismatches, given by Equation (6). This is shown 
schematically in Figure SIO. The mapping from observed /-mer 
counts to gapped k-mer counts is performed by the matrix A, 
whose elements are Oy. If the gapped k-mer Vi matches /-mer uj, 
then ttij = 1, otherwise ay = 0. There is a second matrix W, which 
performs the mapping from gapped h-mer counts to eslimaled l- 
mer counts, and whose elements are Wy. In a previous work we 
showed that matrix W is the Penrose-Moore pseudo-inverse of ^ 
[7]. The element Wy only depends on the number of mismatches 
between the /-mer Ui and the gapped k-mer Vj, and is given by 
Equation (6). Here we show that, for efficient computation, we can 
combine the two mapping matrices, A and W, and direcdy 
calculate the minimum norm /-mer count estimates from actual /- 
mer counts in a sequence. We refer to this combined mapping as 
the gkm-filter. The combined mapping matrix G=WA, has 
elements gy, shown on the bottom of Figure SIO. As shown 
below, gij also only depends on the number of mismatches, m, 
between the /-mers n, and Uj. We denote these values by gikijn) 
and refer to this as the gkm-filter since the domain and range of 
this mapping is the same. 

To obtain the element gik(w,), that gives the weight for the 
contribution of an observed /-mer Ui in the training set to the 
minimum norm /-mer count estimate Uj that has exactly ni 
mismatches with Ui, we sum over the contribution of all the 
gapped A-mers that match M;. Note that ay = 0 for all other 

gapped k-mers. There exist '^^ different gapped k- 

mers that match M, and lia\ c- exactly m mismatches with iij. Figure 
SI 1 shows how we enumerate all these gapped k-mevs. The black 
sohd circles denote the m mismatch positions of m, and Uj, the gray 
circles denote the I - m match positions and the empty dotted 
circles denote the I - k gap positions. For a gapped k-mer to have 



exactly / mismatches with My, there are 



ways to select the / 



mismatch positions and 



l — m 

k-t 



ways to select the k - I match 



explained above (Figure SI 1). It can be easily shown thatg'ft(OT) is a 
polynomial of degree k in m. Now using the weights given in 
Equation (10), for any given /-mer, u we finally obtain the 
minimum norm /-mer count estimate as follows: 



x= ^Ntr(u,m)gik{m), 



(11) 



positions. Now considering the weight w(f) for the gapped A-mers 
with t mismatches, the gapped A-mer filter elements, §■»(«) can be 
obtained as follows: 



where NtJji, m) is the number of /-mers with exactiy m mismatches 
with u in the training set. For large values of / and k, the number of 
all possible gapped A-mers gets exponentially large and since this 
method avoids evaluating the gapped k-mer counts, it significandy 
reduces the cost of calculating the /-mer count estimates compared 
to the original method we developed in Ref [7] . 

In summary, we defined a generalized k-mer count (referred to 
as the robust /-mer count estimates) by giving a non-zero weight to 
/-mers with few number of mismatches (In the conventional A-mer 
count only perfectiy matching A-mers are counted). These weights 
are given by gik(ni). Figure S12A shows the plots for gikim) for 
1 — 20 and various values of k. Each plot is normalized so that 
weight corresponding to zero mismatches is equal to one. The case 
with l = k is equivalent to the conventional A-mer count. Also 
Figure S 1 2B shows gik^jn) for k = 6 and various values of /. With a 
frxed length /, higher values of A result in smaller coefficients for 
larger mismatches, and therefore less smoothing of the estimated 
counts (Figure SI 2). Moreover, g'a(m) can become slighdy negative 
for large numbers of mismatches. This is because in our estimation 
of the frequencies we did not restrict the frequencies to be positive, 
and doing so would yield a more complicated expression. The 
assumed Gaussian distribution allows non-physical negative 
frequencies to have non-zero probability. A beta-distribution 
would not have this problem but would introduce offsetting 
comphcations. In cases where the estimated counts are required to 
be strictly positive, such as when we need to calculate the 
logarithm or ratios of the estimated frerjuencies, we truncate the 
gkm-filter gih^tn) l)y setting gikin) — 0 for every- in^mo, where Mq is 
the smallest number of mismatches for which gik(jno)<0. This 
win give an approximation to the value of x in Equation (5), so it 
will no longer strictiy be the minimum norm estimate, but it wUl 
guarantee that all the count estimates are non-negative. 

Gkm-kernei with /-mer count estimates 

Given a sequence S, we define an /-mer count estimate vector 
= (Xpxf , ■ ■ ■ ,xf/)^ where is the number of all /-mers (4 in 
case of DNA sequences), and xf is the estimated count of the /- 
mer appearing in sequence S using Equation (11). Then, we can 
calculate a standard linear kernel simply by using this vector in 
Equation (1). Similar to the gkm-kemel method, we can further 
simplify this equation using the same technique introduced in 
Equation (2) which does not involve the computation of individual 
/-mer estimates. We show that the inner product of the two /-mer 
count estimate vectors can be obtained as follows: 



gik(m) = ^ 



t=0 



l — m 
k-t 



w{i) 



(10) 



In other words, there are 



different ways we can 



l — m 

k-t ) \ t 

construct a gapped A-mer that matches M;, and has exactly / 
mismatches with Uj, by selecting t positions from the m mismatch 
positions and k — t positions from the I - m match positions as 



</-^i/^2>=^^c(«fi,«;2) 
i=i j=i 



(12) 



where rii and ^2 are the number of /-mers in Si and S2, and wf ' is 

the z'th /-mer in and u^^ is thej'th /-mer in S2. If Mi and M2 have 
exactly m mismatches then c(mi, M2) = c^. Grouping all the /-mer 
pairs with m mismatches, we can rewrite Equation (12) as follows: 
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/^2y=J2 N^(Si ,S2)cik{m) 



(13) 



m = 0 



vvlicrc N,„{Si, S2) is the mismatch profile of Si and as previously 
defined in Equation (3). We show that the weight Cik(ni), denoted in 
short by c„, can be obtained as: 



ENCODE ChlP-seq datasets 

The ENCODE ChlP-seq datasets were downloaded from ftp:// 
ftp.ebi.ac.uk/ pub/ databases/ ensembl/ encode/integration_data_ 
jan20 1 1 /byDataType/ peaks/jan20 11/ spp/optimal/hub/. 

Implementation and source code 

We have implemented these algorithms in C-H-, and the source 
code and executable files are available on our website at http:// 
www.beerlab.org/gkmsvm/. 



ZEE 

mj m2 t 



mi-t\ (14) 
(b-l)'ib-2yg„\g^^ 

m\—t I \ r 



mismatches for U2 there are 



positions of Mi and Mj is given by 



ways to select the positions 



where r = mi-HW2~2<— m, h is the alphabet size. The summations 
are taken over the range 0 to /. Figure S 1 3 shows how we obtained 
the equation for c„, similar to the previous development shown in 
Figure SI 1. Given two Z-mers U\ and Mj, with m mismatches and / 
~ m matched positions, we want to enumerate the number of all 
possible /-mers, u, that have mi mismatches with Ui and in2 
mismatches with M2. For this, we assume that t of the ftii 
mismatches are among the I - m match positions and mi - t ot 
them are among the m mismatch positions. There are 

/ — tH \ I l¥l \ 

t ) V w tj ^^y^ *° choose these mi positions and (6 ~ 

1)* choices for the values of the I mismatches. These t mismatches 
plus the m—{mi—t) unselei:ted mismatch positions also do not 
match U2. Then, for the remaining r = m2~{t+ni—(mi—t)) 

mi—t 
r 

and {h - 2f ways to select the values. Hence the total number of /- 
mers, u with mi mismatches with Ui and m2 mismatches with U2, 
where t of the mismatches of u 1 and u are among the I - m match 

l—m\f m \/mi—t 
t )\mi-t)\ r 

{b-\)'(h-ir. 

Using matrix notation, we can further show that c^—gm if we 
use the fuU filter gikiw). To see this, note that 
</-Si^S2>=jSiTyS2=((jxj)T(5x2=xiGTGx2 where xi and X2 
are the l-mer count vectors for Si and 82- Given G = WA, we have 
G^G = (WAf WA=WAWA=WA = G. Hence, 
(f^^ J'^^y = XlGx2. Here A is the binary incidence matrix that 
maps l-mer counts to gapped k-mer counts as defined in Ref [7] 
and W is the Moore-Penrose pseudo-inverse of A . Note that this 
result does not hold for the truncated fdter g^. In that case, we 
directly use Equation (14) to obtain c„ coefficients. 

ROC curves 

To compare the performance of different classification methods, 
we calculated the area under the receiver operating characteristic 
(ROC) curve for each classifier. To plot the ROC curves and 
calculate area under the curves (AUCs) we used the ROCR 
package [37] in R. 

Cross validation 

Following standard five-fold cross validation procedures, we 
divided the positive and negative sets into five segments, left one 
segment out as the test set and used the other four segments for 
training. We repeated for all of the five segments and calculated 
the mean and standard error of the prediction accuracy on the test 
set elements. 



Supporting Information 

Figure SI PWM model for CTCF binding sites. CTCF 
specifically binds to a set of very long sequences via its eleven zinc 
fmgcr domains, which can be effectively modeled by a PWM. This 
CTCF logo was directiy obtained from the JASPAR database [38] 
(available at http://http://jaspar.cgb.ki.se/). 
(PDF) 

Figure S2 Classification results for CTCF binding using 
the CTCF PWM. The CTCF bound regions and the 
corresponding negative regions were scored by the CTCF PWM 
and the best log-odd score for each sequence was then used to 
calculate the ROC curve. Extremely high AUG was achieved, 
indicating that CTCF binding is well-modeled by the PWM. 
(PDF) 

Figure S3 Precision of gkm-SVM is significantly higher 
than kmer-SVM. To extrapolate to larger negative sets, we re- 
trained both gkm-SVM and kmer-SVM on each of the positive 
data sets (CTCF and EP300) against a 10 x larger negative set. We 
independentiy selected the parameter I and k which exhibited the 
best performance when trained on the 1 x negative set as shown in 
Figure 2G and D. We additionally applied mr^a — 3 for efficient 
computation of the gkm-kernel matrix. In contrast to standard 5- 
fold cross validation, we scored a much larger negative set (100 x) to 
obtain more realistic precision recall cur\'e (similar to genome-wide 
prediction), and plotted ROC curves (A and B), and Precision- 
Recall (PR) curves (C and D). In all cases, gkm-SVM significantly 
outperforms kmer-SVM, although the difference is much smaller 
for EP300. At recall = 50%, gkm-SVM for CTCF achieves 59% of 
precision while kmer-SVM achieved only 36%, suggesting that 
gkm-SVM has an almost two-fold lower false discover)- rate. Even 
for EP300, the precision of gkm-SVM at recall = 50% is significantly 
higher than kmer-SVM (35% vs. 28%). 
(PDF) 

Figure S4 gkm-SVM is generally more robust than 

kmer-SVM. We calculated standard deviation (SD) of the AUCs 
from the test CV sets for both gkm-SVM and kmer-SVM. In most 
cases, gkm-SVM AUG SD is significantly smaller than kmer-SVM 
AUG SD. 

(PDF) 

Figure S5 de novo PWMs fi-om gkm-SVMs trained on 
the 467 ENCODE ChlP-seq data sets. The name of the best 
matching known PWM (from Wang et al.) was assigned to each of 
the PWMs. 

(PDF) 

Figure S6 Classification results on the mouse forebrain 
EP300 data set with various methods. For each of the 
methods, we examined combinations of the parameters and 
measured AUCs with 5-GVs. Red dash-dotted line in each plot 

denotes the best AUG achieved by gkm-kernel with / = 9 and A = 6. 
(A) Mismatch kernel for ^ = 6~12 and m=l~2 using original 
implementation. Note that we obtained the kernel of (A, m) = (6, 2) 
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using our implementation due to the prohibitive computing time 
of the original method. (B) The previous experiments were 
repeated using our new implementation with the "adding reverse 
complement sequences option" enabled. (C) Wildcard kernel for 
k-6~l2 and m-l~'i with A - I. Note that we obtained the 
kernels with k = 6, m = 2~3 using our implementation. (D) The 
previous experiments were repeated using our new implementa- 
tion with the reverse complement sequence option enabled. (E) di- 
mismatch kernels for ^ = 6~10 and m=l~3. (F) The previous 
experiments using only the top 1000 most discriminative features 
as recommended in the original study. 
(PDF) 

Figure S7 Comparisons of numing times between 
different methods. (A) gkm-kernel, (B) mismatch-kernel, (C) 
wildcard-kernel, and (D) di-mismatch kernel. For mismatch and 
wildcard we also show results using our tree structure (dashed). For 
consistency, we used a single machine equipped with Intel Core 15- 
2410M (2.30 GHz) processor and 6 GB RAM, except di- 
mismatch kernels. Due to the prohibitive memory requirement 
of the di-mismatch kernels for large k, we separately measured the 
running times on different machines. 
(PDF) 

Figure S8 Comparison of different filters in gkm-SVM 
on human ENCODE ChlP-seq data sets. We compared the 
performance of the gkm-SVM using the gapped kmers as features 
(gkm-SVM) to gkm-SVM using Z-mer count estimates (A) with 
truncated gkm-filter and (B) with full gkm-filter. We used /= 10 
and k = 6 for all the methods. Those data sets where using the 
truncated-filter gives higher AUCs are marked as purpk- circles. 
The truncated filter method is marginally but systematically better 
when AUG is greater than 0.9. 
(PDF) 

Figure S9 Deriving the weights for calculation of the 
mismatch kernel using Equation (3). xi and xj differ in m 
places, u differs xi in mi places and X2 in m2 places, t of the u 
mismatch places are among the 1 - m ,xi, xj common places. 



There are 



l — m 
t 



{b-iy 



{b-Tfi "' / J such 1- 



mers as u. We sum over all OSmi, mg, tSM. 

(PNG) 

Figure SIO Block diagrams of the proposed method for 
the gkm-filter. (Top) Gapped k-mec counts are obtained from l- 
mer cotmts in the training set. Then minimum norm Z-mer count 
estimates are obtained from the gapped ^-mer counts. The the 

elements of the incidence matrix. A, that maps the /-mer counts in the 
training set to the gapped k-mex counts. a,j = 1 if gapped A-mer Vi 
matches /-mer Uj and is zero otherwise. w,j's are the elements of the 
matrix W (the pseudo-inverse of A) mapping gapped ^-mer 
frequencies to estimated Z-mer frequencies. (Bottom) We combine 
the two mapping matrices A and W to directly calculate the minimum 
norm Z-mer count estimates from the Z-mer counts in the training set. 
gij^ are the elements of matrix G mapping the /-mer counts in the 
training set to the minimum norm /-mer count estimates. 
(PNG) 



Figure Sll Enumeration of gapped A-mers with exactly 
t mismatches. Given the Z-mers M, and My, the number of 
different ways we can construct a gapped A-mer that matches Mj, 

l — m^ 
k-t 



and has exactiy / mismatches with m, is 



there are 



ways to select the t mismatch positions and 



k-t 



ways to select the k-t match positions. The black solid 



circles denote the m mismatch positions of m, and Uj, the gray 
circles denote the I - m match positions, and the empty dotted 
circles denote the I - k unselected (gap) positions. 

(PNG) 

Figure SI 2 Plot of giic(tn). Plot of the normalized filter 
function gik{vi) for (A) / = 20 and various values of k and (S) k = 6 
and various values of /. 
(PNG) 

Figure S13 Enumeration of 2-mers with mj and 
mismatches. Given two Z-mers Mi and M2, with m mismatches 
and l-m matched positions, we want to enumerate the number of 

all possible /-mers, u' , that have JBi mismatches with M] and wij 
mismatches with M2. For this, we assume that I of the nii 
mismatches are among the I - m match positions and wii — / of 
them are among the m mismatch positions. There are 
I — Tfi \ I in \ 

ways to choose these ttii positions and 



ways to select the 



/)) mismatches for U2 there are 



t J \mi—t 

{b—\)' choices for the values of the / mismatches. These t 
mismatches plus the {m — (mi—l)) unselected mismatch positions 
are also mismatches for 1/2. For the remaining r = m2—{t-Hn — {mi — 

mi—t 
r 

positions and {b — lf ways to select the values. Hence the total 
number of Z-mers, m', with mi mismatches with Mi and OT2 
mismatches with M2, where t of the mismatches of Mi and u' are 
among 

the (/ - m) match positions of Mi and M2 is given by 
l — m\ / m 
t J\mi- 

PNG) 



.^)("'7')(^'-im-2r 



Text SI 

(PDF) 



Numerical example. 
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